Load Data
setwd("~/Dropbox/Academia/Hawaii/Carlos_Thesis_Papers/Thesis/Chapters/scripts/data")
derby = read.csv("derby.csv", header = TRUE)
lucene = read.csv("lucene.csv", header = TRUE)
pdfbox = read.csv("pdfbox.csv", header = TRUE)
ivy = read.csv("ivy.csv", header = TRUE)
Derby, Lucene, Pdfbox, Ivy and Ftpserver are all projects of the Apache Software Foundation. Some data pertaining each of these projects is loaded on each variable.
The purpose of this analysis is to observe if we can establish any relationship between structural complexity of code files and the amount of effort that was taken to maintain them. Concretly, we operationalize structural complexity as a set of OO Metrics, each of which measure structural complexity. Effort is operationalized on the variables discussion, actions and churn.
Each observation is given in a row and can be considered as follows: For every change to a file to address an issue in a given release we calculate structural complexity file metrics and the associated effort. Concretely, each row is identified by the columns file, issue_code and release. A discussion in respect of how each effort estimator is mapped to a file metric is discussed on the paper.
Since this is a cross-sectional study, we must train and test our models in a given point in time. Since we only take measures per release, we further consider the set of datapoints that belong to each release as potential training or test sets. We must be careful however to analyze which releases can be used according to their size (some may lack enough data points to be used).
suppressMessages(library(plyr))
amountDataPointsPerRelease <- function(data) {
ddply(data, .(release), summarise, n = length(release))
}
The amount of data points per release in Derby, Lucene, Pdfbox and Ivy respectively is as follows:
amountDataPointsPerRelease(derby)
## release n
## 1 10.1.1.0 97
## 2 10.1.2.1 162
## 3 10.1.3.1 57
## 4 10.2.1.6 14
## 5 10.2.2.0 10
## 6 10.3.1.4 4
## 7 10.3.2.1 1
## 8 10.3.3.0 4
## 9 10.4.2.0 7
## 10 10.5.1.1 4
## 11 10.5.3.0 106
## 12 10.6.1.0 84
## 13 10.6.2.1 30
## 14 10.7.1.1 83
amountDataPointsPerRelease(lucene)
## release n
## 1 1.9.1 1
## 2 2.2 1
## 3 2.3 3
## 4 2.3.1 20
## 5 2.3.2 7
## 6 2.4 3
## 7 2.9 5
## 8 2.9.1 2
## 9 2.9.2 56
## 10 2.9.3 35
## 11 2.9.4 2
## 12 3.0 17
amountDataPointsPerRelease(pdfbox)
## release n
## 1 1.1.0 43
## 2 1.2.1 48
## 3 1.3.1 23
## 4 1.4.0 56
## 5 1.5.0 53
## 6 1.6.0 10
amountDataPointsPerRelease(ivy)
## release n
## 1 2.0 14
## 2 2.0-RC1 29
## 3 2.0-RC2 10
## 4 2.0.0-alpha-2 18
## 5 2.0.0-beta-1 25
## 6 2.0.0-beta-2 190
## 7 2.1.0 48
## 8 2.1.0-RC1 22
## 9 2.1.0-RC2 16
## 10 2.2.0 25
## 11 2.2.0-RC1 11
We conclude that some releases can't be used for the analysis. We decided that a threshold of at least 40 data points is a reasonable amount of data for a release to be considered either as a training or as a test set.
filterReleases <- function(data, threshold) {
# Obtain the release values that fall below the threshold for this dataset
data.perRelease = amountDataPointsPerRelease(data)
releases = data.perRelease[data.perRelease$n > threshold, 1]
# Return datapoints that belongs only to the releases above the threshold
data = data[data$release %in% releases, ]
data
}
derby = filterReleases(derby, 40)
lucene = filterReleases(lucene, 40)
pdfbox = filterReleases(pdfbox, 40)
ivy = filterReleases(ivy, 40)
bla = ddply(derby, .(release, issue_code, discussion), summarise, churn_median = median(churn),
churn_mean = mean(churn), churn_sd = sd(churn), actions_max = max(actions),
mean_raw_loc = mean(raw_loc), mean_ckjm_dit = mean(ckjm_dit), mean_ckjm_ca = mean(ckjm_ca),
mean_ckjm_npm = mean(ckjm_npm), mean_ckjm_cbo = mean(ckjm_cbo), mean_ckjm_noc = mean(ckjm_noc),
mean_ckjm_rfc = mean(ckjm_rfc), mean_ckjm_lcom = mean(ckjm_lcom), mean_ckjm_wmc = mean(ckjm_wmc),
n = length(file))
derby.all = derby[, c(6, 7, 8, 9:17)]
lucene.all = lucene[, c(6, 7, 8, 9:17)]
pdfbox.all = pdfbox[, c(6, 7, 8, 9:17)]
ivy.all = ivy[, c(6, 7, 8, 9:17)]
# derby.means = ddply(derby, .(release,issue_code,discussion), summarise,
# churn_mean = mean(churn), actions_max = max(actions), mean_raw_loc =
# mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca =
# mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))
# Try to average the discussion instead of the files
derby.means = ddply(derby, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca,
ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise,
discussion_mean = mean(discussion), actions_mean = mean(actions))
cor(derby.means[, c(3:14)], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.20956 0.10334 0.023104 0.04809 0.01525
## raw_loc 0.20956 1.00000 0.02667 0.313768 0.55569 0.68729
## ckjm_dit 0.10334 0.02667 1.00000 0.161639 -0.01732 -0.22633
## ckjm_ca 0.02310 0.31377 0.16164 1.000000 0.59146 0.26484
## ckjm_npm 0.04809 0.55569 -0.01732 0.591456 1.00000 0.51375
## ckjm_cbo 0.01525 0.68729 -0.22633 0.264843 0.51375 1.00000
## ckjm_noc 0.02167 0.27814 -0.11865 0.304152 0.36782 0.25031
## ckjm_rfc 0.19838 0.94201 0.01222 0.363450 0.64584 0.76921
## ckjm_lcom 0.12127 0.69143 0.14654 0.476083 0.67603 0.50141
## ckjm_wmc 0.18375 0.85035 0.10970 0.539586 0.78530 0.61761
## discussion_mean 0.61275 0.10808 0.03009 -0.002724 -0.01154 -0.01904
## actions_mean 0.43649 0.02382 0.12171 0.019092 -0.10656 -0.04692
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.021671 0.19838 0.12127 0.183748 0.612755
## raw_loc 0.278143 0.94201 0.69143 0.850348 0.108077
## ckjm_dit -0.118648 0.01222 0.14654 0.109699 0.030092
## ckjm_ca 0.304152 0.36345 0.47608 0.539586 -0.002724
## ckjm_npm 0.367824 0.64584 0.67603 0.785302 -0.011537
## ckjm_cbo 0.250315 0.76921 0.50141 0.617613 -0.019043
## ckjm_noc 1.000000 0.30474 0.32402 0.380496 0.008888
## ckjm_rfc 0.304740 1.00000 0.74000 0.904341 0.097768
## ckjm_lcom 0.324016 0.74000 1.00000 0.835038 0.074350
## ckjm_wmc 0.380496 0.90434 0.83504 1.000000 0.072536
## discussion_mean 0.008888 0.09777 0.07435 0.072536 1.000000
## actions_mean -0.037214 0.01259 -0.04319 -0.005744 0.319389
## actions_mean
## churn 0.436488
## raw_loc 0.023816
## ckjm_dit 0.121707
## ckjm_ca 0.019092
## ckjm_npm -0.106563
## ckjm_cbo -0.046918
## ckjm_noc -0.037214
## ckjm_rfc 0.012590
## ckjm_lcom -0.043187
## ckjm_wmc -0.005744
## discussion_mean 0.319389
## actions_mean 1.000000
cor(derby.all, method = "spearman")
## churn actions discussion raw_loc ckjm_dit ckjm_ca
## churn 1.00000 0.436488 0.612755 0.20956 0.10334 0.023104
## actions 0.43649 1.000000 0.319389 0.02382 0.12171 0.019092
## discussion 0.61275 0.319389 1.000000 0.10808 0.03009 -0.002724
## raw_loc 0.20956 0.023816 0.108077 1.00000 0.02667 0.313768
## ckjm_dit 0.10334 0.121707 0.030092 0.02667 1.00000 0.161639
## ckjm_ca 0.02310 0.019092 -0.002724 0.31377 0.16164 1.000000
## ckjm_npm 0.04809 -0.106563 -0.011537 0.55569 -0.01732 0.591456
## ckjm_cbo 0.01525 -0.046918 -0.019043 0.68729 -0.22633 0.264843
## ckjm_noc 0.02167 -0.037214 0.008888 0.27814 -0.11865 0.304152
## ckjm_rfc 0.19838 0.012590 0.097768 0.94201 0.01222 0.363450
## ckjm_lcom 0.12127 -0.043187 0.074350 0.69143 0.14654 0.476083
## ckjm_wmc 0.18375 -0.005744 0.072536 0.85035 0.10970 0.539586
## ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn 0.04809 0.01525 0.021671 0.19838 0.12127 0.183748
## actions -0.10656 -0.04692 -0.037214 0.01259 -0.04319 -0.005744
## discussion -0.01154 -0.01904 0.008888 0.09777 0.07435 0.072536
## raw_loc 0.55569 0.68729 0.278143 0.94201 0.69143 0.850348
## ckjm_dit -0.01732 -0.22633 -0.118648 0.01222 0.14654 0.109699
## ckjm_ca 0.59146 0.26484 0.304152 0.36345 0.47608 0.539586
## ckjm_npm 1.00000 0.51375 0.367824 0.64584 0.67603 0.785302
## ckjm_cbo 0.51375 1.00000 0.250315 0.76921 0.50141 0.617613
## ckjm_noc 0.36782 0.25031 1.000000 0.30474 0.32402 0.380496
## ckjm_rfc 0.64584 0.76921 0.304740 1.00000 0.74000 0.904341
## ckjm_lcom 0.67603 0.50141 0.324016 0.74000 1.00000 0.835038
## ckjm_wmc 0.78530 0.61761 0.380496 0.90434 0.83504 1.000000
# lucene.means = ddply(lucene, .(release,issue_code,discussion),
# summarise, churn_mean = mean(churn), actions_max = max(actions),
# mean_raw_loc = mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca
# = mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))
lucene.means = ddply(lucene, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca,
ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise,
discussion_mean = mean(discussion), actions_mean = mean(actions))
cor(lucene.means[, c(3:14)], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 -0.01779 -0.02746 -0.07796 -0.12384 0.01204
## raw_loc -0.01779 1.00000 0.65722 0.63864 0.69413 0.85936
## ckjm_dit -0.02746 0.65722 1.00000 0.79861 0.44885 0.61369
## ckjm_ca -0.07796 0.63864 0.79861 1.00000 0.43759 0.57931
## ckjm_npm -0.12384 0.69413 0.44885 0.43759 1.00000 0.55044
## ckjm_cbo 0.01204 0.85936 0.61369 0.57931 0.55044 1.00000
## ckjm_noc 0.02890 0.15666 0.47376 0.56650 0.06137 0.31684
## ckjm_rfc -0.02734 0.96308 0.68829 0.62440 0.77516 0.89379
## ckjm_lcom -0.15162 0.80937 0.67958 0.74415 0.62512 0.66545
## ckjm_wmc -0.04474 0.94364 0.71369 0.70397 0.77584 0.81565
## discussion_mean 0.66407 0.02804 -0.04835 -0.09014 -0.01720 0.16206
## actions_mean 0.33773 -0.23778 -0.21412 -0.09185 -0.20546 -0.13125
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.02890 -0.02734 -0.1516 -0.04474 0.66407
## raw_loc 0.15666 0.96308 0.8094 0.94364 0.02804
## ckjm_dit 0.47376 0.68829 0.6796 0.71369 -0.04835
## ckjm_ca 0.56650 0.62440 0.7442 0.70397 -0.09014
## ckjm_npm 0.06137 0.77516 0.6251 0.77584 -0.01720
## ckjm_cbo 0.31684 0.89379 0.6655 0.81565 0.16206
## ckjm_noc 1.00000 0.16049 0.1983 0.22700 0.01291
## ckjm_rfc 0.16049 1.00000 0.7778 0.94026 0.06491
## ckjm_lcom 0.19827 0.77780 1.0000 0.84625 -0.12155
## ckjm_wmc 0.22700 0.94026 0.8462 1.00000 0.05218
## discussion_mean 0.01291 0.06491 -0.1216 0.05218 1.00000
## actions_mean -0.04588 -0.17863 -0.2566 -0.23744 0.10253
## actions_mean
## churn 0.33773
## raw_loc -0.23778
## ckjm_dit -0.21412
## ckjm_ca -0.09185
## ckjm_npm -0.20546
## ckjm_cbo -0.13125
## ckjm_noc -0.04588
## ckjm_rfc -0.17863
## ckjm_lcom -0.25664
## ckjm_wmc -0.23744
## discussion_mean 0.10253
## actions_mean 1.00000
cor(lucene.all, method = "spearman")
## churn actions discussion raw_loc ckjm_dit ckjm_ca
## churn 1.00000 0.33773 0.66407 -0.01779 -0.02746 -0.07796
## actions 0.33773 1.00000 0.10253 -0.23778 -0.21412 -0.09185
## discussion 0.66407 0.10253 1.00000 0.02804 -0.04835 -0.09014
## raw_loc -0.01779 -0.23778 0.02804 1.00000 0.65722 0.63864
## ckjm_dit -0.02746 -0.21412 -0.04835 0.65722 1.00000 0.79861
## ckjm_ca -0.07796 -0.09185 -0.09014 0.63864 0.79861 1.00000
## ckjm_npm -0.12384 -0.20546 -0.01720 0.69413 0.44885 0.43759
## ckjm_cbo 0.01204 -0.13125 0.16206 0.85936 0.61369 0.57931
## ckjm_noc 0.02890 -0.04588 0.01291 0.15666 0.47376 0.56650
## ckjm_rfc -0.02734 -0.17863 0.06491 0.96308 0.68829 0.62440
## ckjm_lcom -0.15162 -0.25664 -0.12155 0.80937 0.67958 0.74415
## ckjm_wmc -0.04474 -0.23744 0.05218 0.94364 0.71369 0.70397
## ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn -0.12384 0.01204 0.02890 -0.02734 -0.1516 -0.04474
## actions -0.20546 -0.13125 -0.04588 -0.17863 -0.2566 -0.23744
## discussion -0.01720 0.16206 0.01291 0.06491 -0.1216 0.05218
## raw_loc 0.69413 0.85936 0.15666 0.96308 0.8094 0.94364
## ckjm_dit 0.44885 0.61369 0.47376 0.68829 0.6796 0.71369
## ckjm_ca 0.43759 0.57931 0.56650 0.62440 0.7442 0.70397
## ckjm_npm 1.00000 0.55044 0.06137 0.77516 0.6251 0.77584
## ckjm_cbo 0.55044 1.00000 0.31684 0.89379 0.6655 0.81565
## ckjm_noc 0.06137 0.31684 1.00000 0.16049 0.1983 0.22700
## ckjm_rfc 0.77516 0.89379 0.16049 1.00000 0.7778 0.94026
## ckjm_lcom 0.62512 0.66545 0.19827 0.77780 1.0000 0.84625
## ckjm_wmc 0.77584 0.81565 0.22700 0.94026 0.8462 1.00000
# ivy.means = ddply(ivy, .(release,issue_code,discussion), summarise,
# churn_mean = mean(churn), actions_max = max(actions), mean_raw_loc =
# mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca =
# mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))
ivy.means = ddply(ivy, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca, ckjm_npm,
ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise, discussion_mean = mean(discussion),
actions_mean = mean(actions))
cor(ivy.means[, c(3:14)], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.000000 -0.04043 0.006849 -0.08043 0.01725 -0.01943
## raw_loc -0.040433 1.00000 0.400096 0.31839 0.61566 0.78074
## ckjm_dit 0.006849 0.40010 1.000000 0.57474 0.30273 0.43413
## ckjm_ca -0.080427 0.31839 0.574738 1.00000 0.47016 0.36316
## ckjm_npm 0.017248 0.61566 0.302729 0.47016 1.00000 0.47624
## ckjm_cbo -0.019435 0.78074 0.434132 0.36316 0.47624 1.00000
## ckjm_noc 0.020330 -0.04828 -0.071066 0.03143 0.11215 -0.02191
## ckjm_rfc -0.017455 0.95029 0.450035 0.34946 0.60861 0.88113
## ckjm_lcom 0.011535 0.63499 0.202051 0.46169 0.86990 0.52923
## ckjm_wmc -0.002518 0.75340 0.382969 0.49444 0.94541 0.63855
## discussion_mean 0.593934 -0.14384 -0.107982 -0.11113 -0.06317 -0.13370
## actions_mean 0.226956 -0.13078 -0.010047 0.01550 -0.03301 -0.09987
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.020330 -0.01745 0.01153 -0.002518 0.59393
## raw_loc -0.048281 0.95029 0.63499 0.753398 -0.14384
## ckjm_dit -0.071066 0.45003 0.20205 0.382969 -0.10798
## ckjm_ca 0.031433 0.34946 0.46169 0.494441 -0.11113
## ckjm_npm 0.112151 0.60861 0.86990 0.945412 -0.06317
## ckjm_cbo -0.021907 0.88113 0.52923 0.638547 -0.13370
## ckjm_noc 1.000000 -0.03717 0.13117 0.117528 0.08169
## ckjm_rfc -0.037175 1.00000 0.61923 0.761252 -0.13262
## ckjm_lcom 0.131174 0.61923 1.00000 0.886539 -0.08457
## ckjm_wmc 0.117528 0.76125 0.88654 1.000000 -0.09615
## discussion_mean 0.081694 -0.13262 -0.08457 -0.096149 1.00000
## actions_mean -0.009676 -0.09706 -0.08366 -0.062500 0.20078
## actions_mean
## churn 0.226956
## raw_loc -0.130782
## ckjm_dit -0.010047
## ckjm_ca 0.015500
## ckjm_npm -0.033015
## ckjm_cbo -0.099873
## ckjm_noc -0.009676
## ckjm_rfc -0.097063
## ckjm_lcom -0.083659
## ckjm_wmc -0.062500
## discussion_mean 0.200775
## actions_mean 1.000000
cor(ivy.all, method = "spearman")
## churn actions discussion raw_loc ckjm_dit ckjm_ca
## churn 1.0000000 0.231230 0.60103 -0.04119 0.01088 -0.07864
## actions 0.2312298 1.000000 0.20458 -0.13002 -0.00880 0.01579
## discussion 0.6010334 0.204577 1.00000 -0.13951 -0.10279 -0.10908
## raw_loc -0.0411928 -0.130020 -0.13951 1.00000 0.39710 0.31914
## ckjm_dit 0.0108848 -0.008800 -0.10279 0.39710 1.00000 0.57293
## ckjm_ca -0.0786429 0.015791 -0.10908 0.31914 0.57293 1.00000
## ckjm_npm 0.0197779 -0.032689 -0.06123 0.61141 0.30135 0.47174
## ckjm_cbo -0.0222131 -0.098882 -0.12914 0.78285 0.42915 0.36078
## ckjm_noc 0.0266050 -0.007428 0.08690 -0.04762 -0.06928 0.03172
## ckjm_rfc -0.0197180 -0.096311 -0.12841 0.95058 0.44542 0.34798
## ckjm_lcom 0.0103519 -0.084503 -0.08560 0.63123 0.20079 0.46209
## ckjm_wmc 0.0004067 -0.061514 -0.09258 0.75081 0.38064 0.49576
## ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn 0.01978 -0.02221 0.026605 -0.01972 0.01035 0.0004067
## actions -0.03269 -0.09888 -0.007428 -0.09631 -0.08450 -0.0615137
## discussion -0.06123 -0.12914 0.086902 -0.12841 -0.08560 -0.0925847
## raw_loc 0.61141 0.78285 -0.047620 0.95058 0.63123 0.7508100
## ckjm_dit 0.30135 0.42915 -0.069283 0.44542 0.20079 0.3806393
## ckjm_ca 0.47174 0.36078 0.031716 0.34798 0.46209 0.4957634
## ckjm_npm 1.00000 0.46786 0.112784 0.60029 0.87099 0.9454394
## ckjm_cbo 0.46786 1.00000 -0.021794 0.88246 0.52134 0.6326400
## ckjm_noc 0.11278 -0.02179 1.000000 -0.03722 0.12949 0.1187484
## ckjm_rfc 0.60029 0.88246 -0.037223 1.00000 0.61185 0.7550777
## ckjm_lcom 0.87099 0.52134 0.129492 0.61185 1.00000 0.8860091
## ckjm_wmc 0.94544 0.63264 0.118748 0.75508 0.88601 1.0000000
# pdfbox.means = ddply(pdfbox, .(release,issue_code,discussion),
# summarise, churn_mean = mean(churn), actions_max = max(actions),
# mean_raw_loc = mean(raw_loc),mean_ckjm_dit = mean(ckjm_dit),mean_ckjm_ca
# = mean(ckjm_ca),mean_ckjm_npm = mean(ckjm_npm),mean_ckjm_cbo =
# mean(ckjm_cbo),mean_ckjm_noc = mean(ckjm_noc),mean_ckjm_rfc =
# mean(ckjm_rfc),mean_ckjm_lcom = mean(ckjm_lcom),mean_ckjm_wmc =
# mean(ckjm_wmc))
pdfbox.means = ddply(pdfbox, .(file, release, churn, raw_loc, ckjm_dit, ckjm_ca,
ckjm_npm, ckjm_cbo, ckjm_noc, ckjm_rfc, ckjm_lcom, ckjm_wmc), summarise,
discussion_mean = mean(discussion), actions_mean = mean(actions))
cor(pdfbox.means[, c(3:14)], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.11951 -0.10151 0.11137 0.1699 0.12277
## raw_loc 0.11951 1.00000 0.35212 0.52607 0.5492 0.63545
## ckjm_dit -0.10151 0.35212 1.00000 0.12527 0.2075 0.18796
## ckjm_ca 0.11137 0.52607 0.12527 1.00000 0.7093 0.26250
## ckjm_npm 0.16986 0.54918 0.20746 0.70931 1.0000 0.45676
## ckjm_cbo 0.12277 0.63545 0.18796 0.26250 0.4568 1.00000
## ckjm_noc 0.13196 0.17424 0.08788 0.35122 0.4152 0.18975
## ckjm_rfc 0.13774 0.81559 0.35444 0.40835 0.6080 0.85472
## ckjm_lcom 0.09268 0.68738 0.15630 0.53201 0.5550 0.50405
## ckjm_wmc 0.11786 0.74572 0.37223 0.68009 0.9217 0.57000
## discussion_mean 0.44049 0.16298 -0.02520 0.09515 0.1602 0.12136
## actions_mean -0.08287 0.04366 0.15545 -0.07452 -0.1065 -0.04983
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.13196 0.13774 0.09268 0.11786 0.44049
## raw_loc 0.17424 0.81559 0.68738 0.74572 0.16298
## ckjm_dit 0.08788 0.35444 0.15630 0.37223 -0.02520
## ckjm_ca 0.35122 0.40835 0.53201 0.68009 0.09515
## ckjm_npm 0.41518 0.60803 0.55502 0.92173 0.16020
## ckjm_cbo 0.18975 0.85472 0.50405 0.57000 0.12136
## ckjm_noc 1.00000 0.27542 0.28975 0.36433 -0.04684
## ckjm_rfc 0.27542 1.00000 0.66605 0.76988 0.17440
## ckjm_lcom 0.28975 0.66605 1.00000 0.69034 0.09781
## ckjm_wmc 0.36433 0.76988 0.69034 1.00000 0.14754
## discussion_mean -0.04684 0.17440 0.09781 0.14754 1.00000
## actions_mean -0.08394 -0.01554 -0.02925 -0.04775 0.21832
## actions_mean
## churn -0.08287
## raw_loc 0.04366
## ckjm_dit 0.15545
## ckjm_ca -0.07452
## ckjm_npm -0.10649
## ckjm_cbo -0.04983
## ckjm_noc -0.08394
## ckjm_rfc -0.01554
## ckjm_lcom -0.02925
## ckjm_wmc -0.04775
## discussion_mean 0.21832
## actions_mean 1.00000
cor(pdfbox.all, method = "spearman")
## churn actions discussion raw_loc ckjm_dit ckjm_ca ckjm_npm
## churn 1.00000 -0.08287 0.44049 0.11951 -0.10151 0.11137 0.1699
## actions -0.08287 1.00000 0.21832 0.04366 0.15545 -0.07452 -0.1065
## discussion 0.44049 0.21832 1.00000 0.16298 -0.02520 0.09515 0.1602
## raw_loc 0.11951 0.04366 0.16298 1.00000 0.35212 0.52607 0.5492
## ckjm_dit -0.10151 0.15545 -0.02520 0.35212 1.00000 0.12527 0.2075
## ckjm_ca 0.11137 -0.07452 0.09515 0.52607 0.12527 1.00000 0.7093
## ckjm_npm 0.16986 -0.10649 0.16020 0.54918 0.20746 0.70931 1.0000
## ckjm_cbo 0.12277 -0.04983 0.12136 0.63545 0.18796 0.26250 0.4568
## ckjm_noc 0.13196 -0.08394 -0.04684 0.17424 0.08788 0.35122 0.4152
## ckjm_rfc 0.13774 -0.01554 0.17440 0.81559 0.35444 0.40835 0.6080
## ckjm_lcom 0.09268 -0.02925 0.09781 0.68738 0.15630 0.53201 0.5550
## ckjm_wmc 0.11786 -0.04775 0.14754 0.74572 0.37223 0.68009 0.9217
## ckjm_cbo ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## churn 0.12277 0.13196 0.13774 0.09268 0.11786
## actions -0.04983 -0.08394 -0.01554 -0.02925 -0.04775
## discussion 0.12136 -0.04684 0.17440 0.09781 0.14754
## raw_loc 0.63545 0.17424 0.81559 0.68738 0.74572
## ckjm_dit 0.18796 0.08788 0.35444 0.15630 0.37223
## ckjm_ca 0.26250 0.35122 0.40835 0.53201 0.68009
## ckjm_npm 0.45676 0.41518 0.60803 0.55502 0.92173
## ckjm_cbo 1.00000 0.18975 0.85472 0.50405 0.57000
## ckjm_noc 0.18975 1.00000 0.27542 0.28975 0.36433
## ckjm_rfc 0.85472 0.27542 1.00000 0.66605 0.76988
## ckjm_lcom 0.50405 0.28975 0.66605 1.00000 0.69034
## ckjm_wmc 0.57000 0.36433 0.76988 0.69034 1.00000
This leave us with the following ammount of releases and associated amount of data points for each project:
amountDataPointsPerRelease(derby)
## release n
## 1 10.1.1.0 97
## 2 10.1.2.1 162
## 3 10.1.3.1 57
## 4 10.5.3.0 106
## 5 10.6.1.0 84
## 6 10.7.1.1 83
amountDataPointsPerRelease(lucene)
## release n
## 1 2.9.2 56
amountDataPointsPerRelease(pdfbox)
## release n
## 1 1.1.0 43
## 2 1.2.1 48
## 3 1.4.0 56
## 4 1.5.0 53
amountDataPointsPerRelease(ivy)
## release n
## 1 2.0.0-beta-2 190
## 2 2.1.0 48
For this analysis, since we measured 3 different effort estimators, we are interested in creating 3 models, one for each effort estimator and our chosen structural complexity file metrics.
# Project data for the churn effort estimator models
derby.churn = derby[, c(6, 9:17)]
lucene.churn = lucene[, c(6, 9:17)]
pdfbox.churn = pdfbox[, c(6, 9:17)]
ivy.churn = ivy[, c(6, 9:17)]
# Project data for the actions effort estimator models
derby.actions = derby[, c(7, 9:17)]
lucene.actions = lucene[, c(7, 9:17)]
pdfbox.actions = pdfbox[, c(7, 9:17)]
ivy.actions = ivy[, c(7, 9:17)]
# Project data for the discussion effort estimator models
derby.discussion = derby[, c(8, 9:17)]
lucene.discussion = lucene[, c(8, 9:17)]
pdfbox.discussion = pdfbox[, c(8, 9:17)]
ivy.discussion = ivy[, c(8, 9:17)]
For the remaining three sub sections the analysis is similar given the nature of the variables. For each effort estimator, the following hypothesis will be tested:
All project Analysis
derby.all = derby[, c(6, 7, 8, 9:17)]
lucene.all = lucene[, c(6, 7, 8, 9:17)]
pdfbox.all = pdfbox[, c(6, 7, 8, 9:17)]
ivy.all = ivy[, c(6, 7, 8, 9:17)]
# If we use means instead of the real value which implies in the curse of
# glanularity
derby.all = derby.means[, c(3:14)]
lucene.all = lucene.means[, c(3:14)]
ivy.all = ivy.means[, c(3:14)]
pdfbox.all = pdfbox.means[, c(3:14)]
derby.all.list = split(derby.all, factor(derby$release))
lucene.all.list = split(lucene.all, factor(lucene$release))
pdfbox.all.list = split(pdfbox.all, factor(pdfbox$release))
ivy.all.list = split(ivy.all, factor(ivy$release))
## Warning: data length is not a multiple of split variable
cor(derby.all.list[[1]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.0000000 0.254256 0.282831 -0.11696 0.03694 -0.08520
## raw_loc 0.2542559 1.000000 0.008259 0.35417 0.57017 0.73710
## ckjm_dit 0.2828306 0.008259 1.000000 0.01973 -0.11481 -0.35746
## ckjm_ca -0.1169627 0.354173 0.019730 1.00000 0.75065 0.44710
## ckjm_npm 0.0369445 0.570169 -0.114814 0.75065 1.00000 0.62068
## ckjm_cbo -0.0852005 0.737098 -0.357457 0.44710 0.62068 1.00000
## ckjm_noc 0.0006234 0.163896 -0.177063 0.37648 0.37139 0.20376
## ckjm_rfc 0.2199467 0.962672 -0.020208 0.44928 0.66981 0.77360
## ckjm_lcom 0.1530250 0.738295 0.066292 0.54449 0.63755 0.56876
## ckjm_wmc 0.1939540 0.850710 0.022250 0.66296 0.82385 0.67275
## discussion_mean 0.6215001 0.251734 0.138554 0.01328 0.16360 -0.03983
## actions_mean 0.5891304 0.225210 0.304236 0.09660 0.16787 -0.06031
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.0006234 0.21995 0.15303 0.19395 0.62150
## raw_loc 0.1638955 0.96267 0.73829 0.85071 0.25173
## ckjm_dit -0.1770632 -0.02021 0.06629 0.02225 0.13855
## ckjm_ca 0.3764767 0.44928 0.54449 0.66296 0.01328
## ckjm_npm 0.3713915 0.66981 0.63755 0.82385 0.16360
## ckjm_cbo 0.2037617 0.77360 0.56876 0.67275 -0.03983
## ckjm_noc 1.0000000 0.20271 0.20823 0.30841 0.01127
## ckjm_rfc 0.2027100 1.00000 0.73811 0.89357 0.22045
## ckjm_lcom 0.2082330 0.73811 1.00000 0.82654 0.16181
## ckjm_wmc 0.3084122 0.89357 0.82654 1.00000 0.22742
## discussion_mean 0.0112738 0.22045 0.16181 0.22742 1.00000
## actions_mean 0.1067537 0.23105 0.18701 0.28342 0.53506
## actions_mean
## churn 0.58913
## raw_loc 0.22521
## ckjm_dit 0.30424
## ckjm_ca 0.09660
## ckjm_npm 0.16787
## ckjm_cbo -0.06031
## ckjm_noc 0.10675
## ckjm_rfc 0.23105
## ckjm_lcom 0.18701
## ckjm_wmc 0.28342
## discussion_mean 0.53506
## actions_mean 1.00000
cor(derby.all.list[[2]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.000000 0.21475 0.14541 0.01016 -0.004322 -0.09635
## raw_loc 0.214746 1.00000 0.08504 0.18368 0.460489 0.56543
## ckjm_dit 0.145415 0.08504 1.00000 0.18146 -0.049855 -0.23951
## ckjm_ca 0.010163 0.18368 0.18146 1.00000 0.513358 0.09717
## ckjm_npm -0.004322 0.46049 -0.04986 0.51336 1.000000 0.40405
## ckjm_cbo -0.096355 0.56543 -0.23951 0.09717 0.404050 1.00000
## ckjm_noc -0.074927 0.24025 -0.00888 0.31572 0.271653 0.08187
## ckjm_rfc 0.175661 0.94077 0.08577 0.24243 0.562374 0.65095
## ckjm_lcom 0.065729 0.66717 0.18730 0.44068 0.641307 0.41203
## ckjm_wmc 0.140878 0.82697 0.16696 0.46695 0.717843 0.46865
## discussion_mean 0.256012 0.01986 -0.10666 0.03830 -0.099919 0.01869
## actions_mean 0.390805 -0.12537 0.03322 0.05797 -0.205594 -0.22419
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn -0.07493 0.17566 0.06573 0.14088 0.25601
## raw_loc 0.24025 0.94077 0.66717 0.82697 0.01986
## ckjm_dit -0.00888 0.08577 0.18730 0.16696 -0.10666
## ckjm_ca 0.31572 0.24243 0.44068 0.46695 0.03830
## ckjm_npm 0.27165 0.56237 0.64131 0.71784 -0.09992
## ckjm_cbo 0.08187 0.65095 0.41203 0.46865 0.01869
## ckjm_noc 1.00000 0.20995 0.28632 0.32192 -0.11874
## ckjm_rfc 0.20995 1.00000 0.74068 0.87532 0.03043
## ckjm_lcom 0.28632 0.74068 1.00000 0.83400 -0.05556
## ckjm_wmc 0.32192 0.87532 0.83400 1.00000 -0.06952
## discussion_mean -0.11874 0.03043 -0.05556 -0.06952 1.00000
## actions_mean -0.15830 -0.13092 -0.19137 -0.17575 0.51677
## actions_mean
## churn 0.39081
## raw_loc -0.12537
## ckjm_dit 0.03322
## ckjm_ca 0.05797
## ckjm_npm -0.20559
## ckjm_cbo -0.22419
## ckjm_noc -0.15830
## ckjm_rfc -0.13092
## ckjm_lcom -0.19137
## ckjm_wmc -0.17575
## discussion_mean 0.51677
## actions_mean 1.00000
cor(derby.all.list[[3]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.0000000 0.15224 0.001261 -0.06511 -0.2403 0.04648
## raw_loc 0.1522361 1.00000 -0.019743 0.23238 0.5453 0.62586
## ckjm_dit 0.0012607 -0.01974 1.000000 0.15646 -0.0867 -0.26151
## ckjm_ca -0.0651142 0.23238 0.156465 1.00000 0.6041 0.30042
## ckjm_npm -0.2403042 0.54528 -0.086696 0.60414 1.0000 0.61431
## ckjm_cbo 0.0464833 0.62586 -0.261508 0.30042 0.6143 1.00000
## ckjm_noc -0.1492371 0.39826 -0.141732 0.28338 0.5026 0.41552
## ckjm_rfc 0.0787770 0.95294 -0.100620 0.27286 0.6613 0.73537
## ckjm_lcom -0.0001625 0.74187 0.095026 0.51974 0.7136 0.66605
## ckjm_wmc -0.0338199 0.82289 0.022201 0.50218 0.8122 0.68525
## discussion_mean 0.7107962 -0.04523 -0.020782 -0.11096 -0.2441 0.02669
## actions_mean 0.5104830 -0.09393 0.159289 -0.26785 -0.2966 -0.01022
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn -0.14924 0.07878 -0.0001625 -0.03382 0.71080
## raw_loc 0.39826 0.95294 0.7418728 0.82289 -0.04523
## ckjm_dit -0.14173 -0.10062 0.0950259 0.02220 -0.02078
## ckjm_ca 0.28338 0.27286 0.5197388 0.50218 -0.11096
## ckjm_npm 0.50258 0.66125 0.7135938 0.81221 -0.24413
## ckjm_cbo 0.41552 0.73537 0.6660491 0.68525 0.02669
## ckjm_noc 1.00000 0.45960 0.5689546 0.51993 -0.04646
## ckjm_rfc 0.45960 1.00000 0.7875099 0.88476 -0.07240
## ckjm_lcom 0.56895 0.78751 1.0000000 0.89737 -0.02056
## ckjm_wmc 0.51993 0.88476 0.8973724 1.00000 -0.15133
## discussion_mean -0.04646 -0.07240 -0.0205551 -0.15133 1.00000
## actions_mean -0.27255 -0.12622 -0.1495810 -0.21070 0.54021
## actions_mean
## churn 0.51048
## raw_loc -0.09393
## ckjm_dit 0.15929
## ckjm_ca -0.26785
## ckjm_npm -0.29663
## ckjm_cbo -0.01022
## ckjm_noc -0.27255
## ckjm_rfc -0.12622
## ckjm_lcom -0.14958
## ckjm_wmc -0.21070
## discussion_mean 0.54021
## actions_mean 1.00000
cor(derby.all.list[[4]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.37647 0.12690 0.27194 0.19763 0.08956
## raw_loc 0.37647 1.00000 0.01517 0.43832 0.57662 0.73423
## ckjm_dit 0.12690 0.01517 1.00000 0.04713 -0.07259 -0.30255
## ckjm_ca 0.27194 0.43832 0.04713 1.00000 0.60321 0.38881
## ckjm_npm 0.19763 0.57662 -0.07259 0.60321 1.00000 0.57356
## ckjm_cbo 0.08956 0.73423 -0.30255 0.38881 0.57356 1.00000
## ckjm_noc 0.18604 0.24846 -0.13798 0.28871 0.37368 0.27051
## ckjm_rfc 0.34912 0.91205 -0.11171 0.49527 0.68738 0.86335
## ckjm_lcom 0.25535 0.61233 0.10856 0.44214 0.59974 0.49265
## ckjm_wmc 0.41402 0.86444 0.07275 0.58261 0.76481 0.67273
## discussion_mean 0.70954 0.27146 0.16413 0.07099 -0.01229 -0.16431
## actions_mean -0.08897 -0.15997 0.24027 0.02727 -0.25225 -0.18063
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.186038 0.3491 0.2554 0.41402 0.70954
## raw_loc 0.248463 0.9120 0.6123 0.86444 0.27146
## ckjm_dit -0.137975 -0.1117 0.1086 0.07275 0.16413
## ckjm_ca 0.288705 0.4953 0.4421 0.58261 0.07099
## ckjm_npm 0.373676 0.6874 0.5997 0.76481 -0.01229
## ckjm_cbo 0.270509 0.8633 0.4926 0.67273 -0.16431
## ckjm_noc 1.000000 0.3194 0.2050 0.37597 0.14345
## ckjm_rfc 0.319355 1.0000 0.6457 0.91616 0.14180
## ckjm_lcom 0.205034 0.6457 1.0000 0.71794 0.15577
## ckjm_wmc 0.375969 0.9162 0.7179 1.00000 0.24105
## discussion_mean 0.143450 0.1418 0.1558 0.24105 1.00000
## actions_mean -0.001394 -0.1593 -0.1156 -0.11502 -0.23824
## actions_mean
## churn -0.088970
## raw_loc -0.159970
## ckjm_dit 0.240272
## ckjm_ca 0.027265
## ckjm_npm -0.252245
## ckjm_cbo -0.180635
## ckjm_noc -0.001394
## ckjm_rfc -0.159282
## ckjm_lcom -0.115608
## ckjm_wmc -0.115018
## discussion_mean -0.238242
## actions_mean 1.000000
cor(derby.all.list[[5]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.212620 0.241218 0.2141 0.29689 0.05793
## raw_loc 0.21262 1.000000 0.004096 0.4044 0.71226 0.80900
## ckjm_dit 0.24122 0.004096 1.000000 0.2215 0.09085 -0.06357
## ckjm_ca 0.21405 0.404368 0.221522 1.0000 0.58666 0.18083
## ckjm_npm 0.29689 0.712263 0.090848 0.5867 1.00000 0.45730
## ckjm_cbo 0.05793 0.809001 -0.063570 0.1808 0.45730 1.00000
## ckjm_noc -0.04234 0.313174 -0.156498 0.2034 0.38123 0.34970
## ckjm_rfc 0.22215 0.971008 0.064950 0.3868 0.73499 0.83297
## ckjm_lcom 0.31253 0.596361 0.201572 0.4600 0.79682 0.35982
## ckjm_wmc 0.31717 0.882342 0.151562 0.5348 0.90903 0.65303
## discussion_mean 0.81740 0.136541 0.056474 0.1440 0.26535 -0.02947
## actions_mean 0.43564 0.498693 0.225226 0.3107 0.37994 0.33309
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn -0.04234 0.22215 0.3125 0.3172 0.81740
## raw_loc 0.31317 0.97101 0.5964 0.8823 0.13654
## ckjm_dit -0.15650 0.06495 0.2016 0.1516 0.05647
## ckjm_ca 0.20342 0.38682 0.4600 0.5348 0.14400
## ckjm_npm 0.38123 0.73499 0.7968 0.9090 0.26535
## ckjm_cbo 0.34970 0.83297 0.3598 0.6530 -0.02947
## ckjm_noc 1.00000 0.33967 0.3737 0.3598 -0.03862
## ckjm_rfc 0.33967 1.00000 0.6355 0.9145 0.14717
## ckjm_lcom 0.37370 0.63550 1.0000 0.7943 0.26025
## ckjm_wmc 0.35977 0.91447 0.7943 1.0000 0.24043
## discussion_mean -0.03862 0.14717 0.2603 0.2404 1.00000
## actions_mean 0.02029 0.50569 0.3429 0.5081 0.30711
## actions_mean
## churn 0.43564
## raw_loc 0.49869
## ckjm_dit 0.22523
## ckjm_ca 0.31070
## ckjm_npm 0.37994
## ckjm_cbo 0.33309
## ckjm_noc 0.02029
## ckjm_rfc 0.50569
## ckjm_lcom 0.34295
## ckjm_wmc 0.50814
## discussion_mean 0.30711
## actions_mean 1.00000
cor(derby.all.list[[6]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.04068 -0.18123 -0.09533 -0.03863 0.13993
## raw_loc 0.04068 1.00000 0.02788 0.40114 0.55332 0.72140
## ckjm_dit -0.18123 0.02788 1.00000 0.40002 0.13842 -0.03303
## ckjm_ca -0.09533 0.40114 0.40002 1.00000 0.56138 0.25725
## ckjm_npm -0.03863 0.55332 0.13842 0.56138 1.00000 0.52626
## ckjm_cbo 0.13993 0.72140 -0.03303 0.25725 0.52626 1.00000
## ckjm_noc 0.13628 0.37671 -0.09934 0.34339 0.44897 0.29756
## ckjm_rfc 0.10356 0.90899 0.03722 0.43370 0.63417 0.85060
## ckjm_lcom 0.01016 0.77201 0.14412 0.56493 0.75494 0.62148
## ckjm_wmc 0.04223 0.85202 0.11542 0.57488 0.76257 0.70321
## discussion_mean 0.67969 -0.16895 0.02266 -0.14024 -0.18610 0.05077
## actions_mean 0.65640 -0.13912 -0.19284 -0.15449 -0.18120 0.03868
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.136282 0.10356 0.01016 0.04223 0.679689
## raw_loc 0.376709 0.90899 0.77201 0.85202 -0.168949
## ckjm_dit -0.099336 0.03722 0.14412 0.11542 0.022664
## ckjm_ca 0.343392 0.43370 0.56493 0.57488 -0.140239
## ckjm_npm 0.448975 0.63417 0.75494 0.76257 -0.186105
## ckjm_cbo 0.297563 0.85060 0.62148 0.70321 0.050771
## ckjm_noc 1.000000 0.40666 0.47768 0.48016 -0.000484
## ckjm_rfc 0.406661 1.00000 0.83373 0.92369 -0.101221
## ckjm_lcom 0.477679 0.83373 1.00000 0.94519 -0.182146
## ckjm_wmc 0.480157 0.92369 0.94519 1.00000 -0.188429
## discussion_mean -0.000484 -0.10122 -0.18215 -0.18843 1.000000
## actions_mean -0.063289 -0.10899 -0.14501 -0.16042 0.485178
## actions_mean
## churn 0.65640
## raw_loc -0.13912
## ckjm_dit -0.19284
## ckjm_ca -0.15449
## ckjm_npm -0.18120
## ckjm_cbo 0.03868
## ckjm_noc -0.06329
## ckjm_rfc -0.10899
## ckjm_lcom -0.14501
## ckjm_wmc -0.16042
## discussion_mean 0.48518
## actions_mean 1.00000
cor(lucene.all.list[[1]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 -0.01779 -0.02746 -0.07796 -0.12384 0.01204
## raw_loc -0.01779 1.00000 0.65722 0.63864 0.69413 0.85936
## ckjm_dit -0.02746 0.65722 1.00000 0.79861 0.44885 0.61369
## ckjm_ca -0.07796 0.63864 0.79861 1.00000 0.43759 0.57931
## ckjm_npm -0.12384 0.69413 0.44885 0.43759 1.00000 0.55044
## ckjm_cbo 0.01204 0.85936 0.61369 0.57931 0.55044 1.00000
## ckjm_noc 0.02890 0.15666 0.47376 0.56650 0.06137 0.31684
## ckjm_rfc -0.02734 0.96308 0.68829 0.62440 0.77516 0.89379
## ckjm_lcom -0.15162 0.80937 0.67958 0.74415 0.62512 0.66545
## ckjm_wmc -0.04474 0.94364 0.71369 0.70397 0.77584 0.81565
## discussion_mean 0.66407 0.02804 -0.04835 -0.09014 -0.01720 0.16206
## actions_mean 0.33773 -0.23778 -0.21412 -0.09185 -0.20546 -0.13125
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.02890 -0.02734 -0.1516 -0.04474 0.66407
## raw_loc 0.15666 0.96308 0.8094 0.94364 0.02804
## ckjm_dit 0.47376 0.68829 0.6796 0.71369 -0.04835
## ckjm_ca 0.56650 0.62440 0.7442 0.70397 -0.09014
## ckjm_npm 0.06137 0.77516 0.6251 0.77584 -0.01720
## ckjm_cbo 0.31684 0.89379 0.6655 0.81565 0.16206
## ckjm_noc 1.00000 0.16049 0.1983 0.22700 0.01291
## ckjm_rfc 0.16049 1.00000 0.7778 0.94026 0.06491
## ckjm_lcom 0.19827 0.77780 1.0000 0.84625 -0.12155
## ckjm_wmc 0.22700 0.94026 0.8462 1.00000 0.05218
## discussion_mean 0.01291 0.06491 -0.1216 0.05218 1.00000
## actions_mean -0.04588 -0.17863 -0.2566 -0.23744 0.10253
## actions_mean
## churn 0.33773
## raw_loc -0.23778
## ckjm_dit -0.21412
## ckjm_ca -0.09185
## ckjm_npm -0.20546
## ckjm_cbo -0.13125
## ckjm_noc -0.04588
## ckjm_rfc -0.17863
## ckjm_lcom -0.25664
## ckjm_wmc -0.23744
## discussion_mean 0.10253
## actions_mean 1.00000
cor(pdfbox.all.list[[1]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.000000 0.08266 -0.11297 -0.002506 0.03566 0.22045
## raw_loc 0.082659 1.00000 0.25233 0.503500 0.62121 0.65912
## ckjm_dit -0.112967 0.25233 1.00000 0.208163 0.32667 0.07423
## ckjm_ca -0.002506 0.50350 0.20816 1.000000 0.67078 0.18252
## ckjm_npm 0.035656 0.62121 0.32667 0.670776 1.00000 0.43445
## ckjm_cbo 0.220451 0.65912 0.07423 0.182521 0.43445 1.00000
## ckjm_noc 0.123168 0.29383 0.21313 0.412596 0.56535 0.12631
## ckjm_rfc 0.271648 0.82477 0.27079 0.349083 0.58778 0.83549
## ckjm_lcom 0.276944 0.63571 -0.13052 0.494029 0.47955 0.54711
## ckjm_wmc 0.078687 0.74983 0.35392 0.679891 0.96324 0.48678
## discussion_mean 0.523478 0.06732 0.04956 -0.095196 0.13650 0.34866
## actions_mean -0.147172 0.23926 0.23382 0.292148 0.08876 0.02213
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.12317 0.27165 0.2769 0.07869 0.52348
## raw_loc 0.29383 0.82477 0.6357 0.74983 0.06732
## ckjm_dit 0.21313 0.27079 -0.1305 0.35392 0.04956
## ckjm_ca 0.41260 0.34908 0.4940 0.67989 -0.09520
## ckjm_npm 0.56535 0.58778 0.4795 0.96324 0.13650
## ckjm_cbo 0.12631 0.83549 0.5471 0.48678 0.34866
## ckjm_noc 1.00000 0.40773 0.3423 0.53112 0.09682
## ckjm_rfc 0.40773 1.00000 0.5884 0.67785 0.23822
## ckjm_lcom 0.34225 0.58839 1.0000 0.55711 0.11542
## ckjm_wmc 0.53112 0.67785 0.5571 1.00000 0.13507
## discussion_mean 0.09682 0.23822 0.1154 0.13507 1.00000
## actions_mean 0.02973 0.06261 0.1848 0.18422 -0.07067
## actions_mean
## churn -0.14717
## raw_loc 0.23926
## ckjm_dit 0.23382
## ckjm_ca 0.29215
## ckjm_npm 0.08876
## ckjm_cbo 0.02213
## ckjm_noc 0.02973
## ckjm_rfc 0.06261
## ckjm_lcom 0.18484
## ckjm_wmc 0.18422
## discussion_mean -0.07067
## actions_mean 1.00000
cor(pdfbox.all.list[[2]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.000000 0.03451 0.011139 0.2088798 0.2469 0.005611
## raw_loc 0.034505 1.00000 0.460094 0.5333520 0.6101 0.689052
## ckjm_dit 0.011139 0.46009 1.000000 0.2077431 0.3912 0.423190
## ckjm_ca 0.208880 0.53335 0.207743 1.0000000 0.7324 0.373447
## ckjm_npm 0.246944 0.61009 0.391197 0.7324485 1.0000 0.670955
## ckjm_cbo 0.005611 0.68905 0.423190 0.3734470 0.6710 1.000000
## ckjm_noc 0.077062 0.12895 0.163043 0.2572310 0.2214 0.277218
## ckjm_rfc 0.096361 0.82956 0.581294 0.5186755 0.7731 0.879006
## ckjm_lcom 0.097579 0.72171 0.288556 0.5551578 0.6696 0.556320
## ckjm_wmc 0.199424 0.74139 0.576381 0.6879733 0.9551 0.764129
## discussion_mean 0.498902 0.17878 -0.007618 0.0002222 0.2147 0.156894
## actions_mean 0.204191 0.10636 -0.092386 0.0123882 0.1247 0.156123
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.07706 0.09636 0.09758 0.1994 0.4989020
## raw_loc 0.12895 0.82956 0.72171 0.7414 0.1787779
## ckjm_dit 0.16304 0.58129 0.28856 0.5764 -0.0076179
## ckjm_ca 0.25723 0.51868 0.55516 0.6880 0.0002222
## ckjm_npm 0.22141 0.77306 0.66956 0.9551 0.2147206
## ckjm_cbo 0.27722 0.87901 0.55632 0.7641 0.1568943
## ckjm_noc 1.00000 0.28467 0.23855 0.2459 -0.0683406
## ckjm_rfc 0.28467 1.00000 0.73548 0.8810 0.2291149
## ckjm_lcom 0.23855 0.73548 1.00000 0.7219 0.1956951
## ckjm_wmc 0.24593 0.88101 0.72194 1.0000 0.2235857
## discussion_mean -0.06834 0.22911 0.19570 0.2236 1.0000000
## actions_mean 0.02417 0.15784 0.07452 0.1246 0.4573630
## actions_mean
## churn 0.20419
## raw_loc 0.10636
## ckjm_dit -0.09239
## ckjm_ca 0.01239
## ckjm_npm 0.12474
## ckjm_cbo 0.15612
## ckjm_noc 0.02417
## ckjm_rfc 0.15784
## ckjm_lcom 0.07452
## ckjm_wmc 0.12456
## discussion_mean 0.45736
## actions_mean 1.00000
cor(pdfbox.all.list[[3]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.3075 0.13900 0.061350 0.11308 0.13370
## raw_loc 0.30748 1.0000 0.32322 0.365720 0.40616 0.63237
## ckjm_dit 0.13900 0.3232 1.00000 0.055127 0.10399 0.20465
## ckjm_ca 0.06135 0.3657 0.05513 1.000000 0.52272 0.09458
## ckjm_npm 0.11308 0.4062 0.10399 0.522722 1.00000 0.28181
## ckjm_cbo 0.13370 0.6324 0.20465 0.094576 0.28181 1.00000
## ckjm_noc 0.25668 0.1540 0.04316 0.433677 0.45411 0.15967
## ckjm_rfc 0.15534 0.7798 0.33889 0.216167 0.45783 0.88306
## ckjm_lcom 0.19973 0.5558 0.16441 0.217444 0.30211 0.42848
## ckjm_wmc 0.17238 0.6751 0.30955 0.465545 0.88590 0.48141
## discussion_mean -0.07297 0.1888 0.28107 0.043735 -0.05105 -0.01361
## actions_mean 0.01439 0.2057 0.21095 -0.009989 -0.08642 0.09200
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.25668 0.1553 0.19973 0.17238 -0.07297
## raw_loc 0.15400 0.7798 0.55582 0.67511 0.18884
## ckjm_dit 0.04316 0.3389 0.16441 0.30955 0.28107
## ckjm_ca 0.43368 0.2162 0.21744 0.46554 0.04373
## ckjm_npm 0.45411 0.4578 0.30211 0.88590 -0.05105
## ckjm_cbo 0.15967 0.8831 0.42848 0.48141 -0.01361
## ckjm_noc 1.00000 0.2361 0.29716 0.36315 -0.19878
## ckjm_rfc 0.23609 1.0000 0.54911 0.70724 0.10807
## ckjm_lcom 0.29716 0.5491 1.00000 0.48857 0.08021
## ckjm_wmc 0.36315 0.7072 0.48857 1.00000 0.06814
## discussion_mean -0.19878 0.1081 0.08021 0.06814 1.00000
## actions_mean -0.16068 0.1306 0.13876 0.01883 0.60793
## actions_mean
## churn 0.014387
## raw_loc 0.205691
## ckjm_dit 0.210948
## ckjm_ca -0.009989
## ckjm_npm -0.086417
## ckjm_cbo 0.092001
## ckjm_noc -0.160680
## ckjm_rfc 0.130572
## ckjm_lcom 0.138761
## ckjm_wmc 0.018830
## discussion_mean 0.607926
## actions_mean 1.000000
cor(pdfbox.all.list[[4]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 0.1215 -0.357074 0.15689 0.2996 0.213176
## raw_loc 0.12150 1.0000 0.314058 0.69346 0.5806 0.509925
## ckjm_dit -0.35707 0.3141 1.000000 0.05843 0.0846 0.006989
## ckjm_ca 0.15689 0.6935 0.058430 1.00000 0.8085 0.330029
## ckjm_npm 0.29958 0.5806 0.084599 0.80854 1.0000 0.449054
## ckjm_cbo 0.21318 0.5099 0.006989 0.33003 0.4491 1.000000
## ckjm_noc -0.04650 0.2515 0.080080 0.32735 0.3621 0.246030
## ckjm_rfc 0.15481 0.7931 0.201890 0.52023 0.5878 0.773068
## ckjm_lcom 0.03189 0.7808 0.166679 0.78342 0.7349 0.435924
## ckjm_wmc 0.11058 0.7596 0.291369 0.78276 0.8809 0.454229
## discussion_mean 0.77301 0.2366 -0.452609 0.31814 0.4328 0.131094
## actions_mean -0.51955 -0.2441 0.349664 -0.32921 -0.4422 -0.379397
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn -0.04650 0.1548 0.03189 0.1106 0.7730
## raw_loc 0.25152 0.7931 0.78077 0.7596 0.2366
## ckjm_dit 0.08008 0.2019 0.16668 0.2914 -0.4526
## ckjm_ca 0.32735 0.5202 0.78342 0.7828 0.3181
## ckjm_npm 0.36212 0.5878 0.73493 0.8809 0.4328
## ckjm_cbo 0.24603 0.7731 0.43592 0.4542 0.1311
## ckjm_noc 1.00000 0.2587 0.39938 0.3642 -0.1139
## ckjm_rfc 0.25870 1.0000 0.73494 0.7220 0.2110
## ckjm_lcom 0.39938 0.7349 1.00000 0.9122 0.1690
## ckjm_wmc 0.36415 0.7220 0.91225 1.0000 0.2361
## discussion_mean -0.11393 0.2110 0.16900 0.2361 1.0000
## actions_mean -0.20291 -0.2926 -0.38453 -0.3656 -0.3560
## actions_mean
## churn -0.5195
## raw_loc -0.2441
## ckjm_dit 0.3497
## ckjm_ca -0.3292
## ckjm_npm -0.4422
## ckjm_cbo -0.3794
## ckjm_noc -0.2029
## ckjm_rfc -0.2926
## ckjm_lcom -0.3845
## ckjm_wmc -0.3656
## discussion_mean -0.3560
## actions_mean 1.0000
cor(ivy.all.list[[1]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.000000 -0.03073 -0.0504024 -0.10119 0.01100 -0.04262
## raw_loc -0.030734 1.00000 0.4372613 0.36212 0.61813 0.74911
## ckjm_dit -0.050402 0.43726 1.0000000 0.59081 0.28687 0.48172
## ckjm_ca -0.101187 0.36212 0.5908138 1.00000 0.46860 0.43568
## ckjm_npm 0.010997 0.61813 0.2868664 0.46860 1.00000 0.48880
## ckjm_cbo -0.042620 0.74911 0.4817189 0.43568 0.48880 1.00000
## ckjm_noc -0.017535 -0.06029 -0.1151528 0.04982 0.13419 0.02978
## ckjm_rfc -0.028410 0.95017 0.4892025 0.41187 0.63135 0.85834
## ckjm_lcom -0.006379 0.61776 0.2187122 0.46319 0.87908 0.51944
## ckjm_wmc -0.014363 0.74784 0.3734588 0.49677 0.95031 0.64037
## discussion_mean 0.582550 -0.12404 -0.1236095 -0.07788 -0.02660 -0.12553
## actions_mean 0.194687 -0.09717 -0.0005875 0.03507 -0.06197 -0.03752
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn -0.01753 -0.02841 -0.006379 -0.01436 0.58255
## raw_loc -0.06029 0.95017 0.617761 0.74784 -0.12404
## ckjm_dit -0.11515 0.48920 0.218712 0.37346 -0.12361
## ckjm_ca 0.04982 0.41187 0.463193 0.49677 -0.07788
## ckjm_npm 0.13419 0.63135 0.879078 0.95031 -0.02660
## ckjm_cbo 0.02978 0.85834 0.519436 0.64037 -0.12553
## ckjm_noc 1.00000 -0.01115 0.170395 0.13733 0.06504
## ckjm_rfc -0.01115 1.00000 0.620444 0.77330 -0.12276
## ckjm_lcom 0.17040 0.62044 1.000000 0.89317 -0.06813
## ckjm_wmc 0.13733 0.77330 0.893167 1.00000 -0.06515
## discussion_mean 0.06504 -0.12276 -0.068133 -0.06515 1.00000
## actions_mean -0.06446 -0.05240 -0.106281 -0.07153 0.19486
## actions_mean
## churn 0.1946873
## raw_loc -0.0971703
## ckjm_dit -0.0005875
## ckjm_ca 0.0350730
## ckjm_npm -0.0619714
## ckjm_cbo -0.0375173
## ckjm_noc -0.0644574
## ckjm_rfc -0.0524048
## ckjm_lcom -0.1062805
## ckjm_wmc -0.0715348
## discussion_mean 0.1948607
## actions_mean 1.0000000
cor(ivy.all.list[[2]], method = "spearman")
## churn raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## churn 1.00000 -0.03704 0.24069 -0.03393 0.070096 0.07017
## raw_loc -0.03704 1.00000 0.38094 0.25006 0.632097 0.83681
## ckjm_dit 0.24069 0.38094 1.00000 0.46648 0.425585 0.38095
## ckjm_ca -0.03393 0.25006 0.46648 1.00000 0.501459 0.21847
## ckjm_npm 0.07010 0.63210 0.42558 0.50146 1.000000 0.48137
## ckjm_cbo 0.07017 0.83681 0.38095 0.21847 0.481374 1.00000
## ckjm_noc 0.18937 -0.05288 0.09644 -0.02595 0.005631 -0.21175
## ckjm_rfc 0.05595 0.93951 0.38279 0.16620 0.572420 0.91050
## ckjm_lcom 0.08840 0.66431 0.19367 0.50458 0.826130 0.55818
## ckjm_wmc 0.06277 0.77963 0.46949 0.49039 0.916560 0.65624
## discussion_mean 0.66106 -0.25453 -0.03955 -0.26531 -0.241694 -0.23307
## actions_mean 0.30466 -0.14218 -0.09492 -0.14315 0.084735 -0.21359
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc discussion_mean
## churn 0.189370 0.05595 0.08840 0.062767 0.66106
## raw_loc -0.052881 0.93951 0.66431 0.779626 -0.25453
## ckjm_dit 0.096442 0.38279 0.19367 0.469487 -0.03955
## ckjm_ca -0.025952 0.16620 0.50458 0.490393 -0.26531
## ckjm_npm 0.005631 0.57242 0.82613 0.916560 -0.24169
## ckjm_cbo -0.211746 0.91050 0.55818 0.656244 -0.23307
## ckjm_noc 1.000000 -0.13338 -0.07160 0.013338 0.14938
## ckjm_rfc -0.133375 1.00000 0.62128 0.741185 -0.19643
## ckjm_lcom -0.071597 0.62128 1.00000 0.858996 -0.18734
## ckjm_wmc 0.013338 0.74118 0.85900 1.000000 -0.23708
## discussion_mean 0.149377 -0.19643 -0.18734 -0.237078 1.00000
## actions_mean 0.168377 -0.14903 0.04066 -0.006774 0.25335
## actions_mean
## churn 0.304665
## raw_loc -0.142185
## ckjm_dit -0.094923
## ckjm_ca -0.143152
## ckjm_npm 0.084735
## ckjm_cbo -0.213590
## ckjm_noc 0.168377
## ckjm_rfc -0.149028
## ckjm_lcom 0.040664
## ckjm_wmc -0.006774
## discussion_mean 0.253353
## actions_mean 1.000000
Salva todos os releases de csv na pasta para ser usados no weka ou em outras ferramentas.
library(Hmisc)
## Loading required package: survival
## Loading required package: splines
## Hmisc library by Frank E Harrell Jr
##
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
##
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting. To get the old behavior of Hmisc type dropUnusedLevels().
## Attaching package: 'Hmisc'
## The following object(s) are masked from 'package:survival':
##
## untangle.specials
## The following object(s) are masked from 'package:plyr':
##
## is.discrete, summarize
## The following object(s) are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
#
# setwd('~/Dropbox/Academia/Hawaii/Carlos_Thesis_Papers/Thesis/Chapters/scripts/weka_data/discussion_variable_is_interval/pdfbox')
# Generate a csv whose name contain the amount of datapoint and the
# release so a cross sectional classification can be performed. for(i in
# 1:length(pdfbox.all.list)){ Create a list of dataframes where each
# dataframe contains datapoints of a given release.
# pdfbox.all.list[[i]]$discussion = cut2(pdfbox.all.list[[i]]$discussion)
# #Dichotomize discussion into intervals of same frequency
# write.csv(pdfbox.all.list[[i]],
# paste0('pdfbox_cross_sectional_n',nrow(pdfbox.all.list[[i]]),'_',pdfbox.all.list[[i]]$release[1],'.csv'))
# }
plot(derby.all.list[[1]])
plot(derby.all.list[[2]])
plot(derby.all.list[[3]])
plot(derby.all.list[[4]])
plot(derby.all.list[[5]])
plot(derby.all.list[[6]])
plot(lucene.all.list[[1]])
plot(pdfbox.all.list[[1]])
plot(pdfbox.all.list[[2]])
plot(pdfbox.all.list[[3]])
plot(pdfbox.all.list[[4]])
plot(ivy.all.list[[1]])
plot(ivy.all.list[[2]])
### Discussion Effort Estimator Analysis
The first thing we must do is obtain the training and test sets fromt he filtered releases. Derby.discussion, lucene.discussion, pdfbox.discussion and ivy.discussion all contains data of all releases. Lets break them down per release:
```r
# Create a list of dataframes (tables) where each dataframe contains
# datapoints of a given release for each project.
derby.discussion.list = split(derby.discussion, factor(derby$release))
lucene.discussion.list = split(lucene.discussion, factor(lucene$release))
pdfbox.discussion.list = split(pdfbox.discussion, factor(pdfbox$release))
ivy.discussion.list = split(ivy.discussion, factor(ivy$release))
Since the order of the dataframes is the order in which the releases ocurred, the first position of each project dataframe contains the first release of each project, the second position of the second project and so on. This leave us with 6 releases of derby, 1 of lucene, 4 of pdfbox and 2 of ivy. Notice that this variation is influenced also by the mapping between file and issues, along with other several threats of validity reported in the paper.
Lets start by observing some characteristics of our datasets. One thing that we are interested in discussion is observing if there is any inflation of zeros (many issues with zero discussions) and how if it change over time.
suppressMessages(library(ggplot2))
qplot(discussion, data = derby, facets = release ~ ., geom = "histogram", binwidth = 1,
main = "Derby")
We can observe that for the distribution of the 6 releases, only the forth (10.5.3.0.) and fifth (10.6.1.0) releases had an inflation of zero in discussions. This might affect how well a model trained in 10.1.1.0 will behave when trying to predict releases such as 4th. and 5th. We can also observe that overall most of the amount of discussion occur in a value range between 0 and 20.
The following plots display the distribution for the remaining projects (lucene, pdfbox, and ivy).
qplot(discussion, data = lucene, facets = release ~ ., geom = "histogram", binwidth = 1,
main = "Lucene")
qplot(discussion, data = pdfbox, facets = release ~ ., geom = "histogram", binwidth = 1,
main = "Pdfbox")
qplot(discussion, data = ivy, facets = release ~ ., geom = "histogram", binwidth = 1,
main = "Ivy")
We can see that an inflation of zeros is not common over the discussion distribution (again, beware that the missing zero discussion may be related to the issue mapping threat of validity).
For the sake of this analysis, I will consider derby as the training set of the model. The remaining releases will be considered test datasets for testing the hypothesis.
derby.discussion.train = derby.discussion.list[[1]]
head(derby.discussion.train)
## discussion raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo ckjm_noc ckjm_rfc
## 7 15 23 2 1 3 0 0 6
## 41 15 1373 0 11 71 19 1 267
## 52 15 2998 1 13 178 20 1 429
## 53 15 2998 1 13 178 20 1 429
## 54 7 2998 1 13 178 20 1 429
## 70 12 1563 1 16 76 20 1 251
## ckjm_lcom ckjm_wmc
## 7 0 3
## 41 450 109
## 52 3560 268
## 53 3560 268
## 54 3560 268
## 70 2946 129
Lastly we can also observe how the distribution of each of the structural complexity file metrics vary over time. Since I am only concerned with the first hypothesis test at this point, that is, intra project analysis, I defer further analysis comparing among projects until the appropriate hypothesis test.
a <- qplot(release, discussion, data = derby, geom = "boxplot")
b <- qplot(release, raw_loc, data = derby, geom = "boxplot")
c <- qplot(release, ckjm_dit, data = derby, geom = "boxplot")
d <- qplot(release, ckjm_ca, data = derby, geom = "boxplot")
e <- qplot(release, ckjm_npm, data = derby, geom = "boxplot")
f <- qplot(release, ckjm_cbo, data = derby, geom = "boxplot")
g <- qplot(release, ckjm_noc, data = derby, geom = "boxplot")
h <- qplot(release, ckjm_rfc, data = derby, geom = "boxplot")
i <- qplot(release, ckjm_lcom, data = derby, geom = "boxplot")
j <- qplot(release, ckjm_wmc, data = derby, geom = "boxplot")
library(grid)
grid.newpage()
pushViewport(viewport(layout = grid.layout(10, 1)))
print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1), main = "Derby")
print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(c, vp = viewport(layout.pos.row = 3, layout.pos.col = 1))
print(d, vp = viewport(layout.pos.row = 4, layout.pos.col = 1))
print(e, vp = viewport(layout.pos.row = 5, layout.pos.col = 1))
print(f, vp = viewport(layout.pos.row = 6, layout.pos.col = 1))
print(g, vp = viewport(layout.pos.row = 7, layout.pos.col = 1))
print(h, vp = viewport(layout.pos.row = 8, layout.pos.col = 1))
print(i, vp = viewport(layout.pos.row = 9, layout.pos.col = 1))
print(j, vp = viewport(layout.pos.row = 10, layout.pos.col = 1))
We can see that there is a great dispersion of discussion across the releases, while the same does not occur with the structural complexity file metrics.
This section script performs an analysis of the data based almost entirely on this journal.
We need to observe which among our predictors are correlated. Correlated predictors should not be used.
cor(derby.discussion.train, method = "spearman")
## discussion raw_loc ckjm_dit ckjm_ca ckjm_npm ckjm_cbo
## discussion 1.00000 0.251734 0.138554 0.01328 0.1636 -0.03983
## raw_loc 0.25173 1.000000 0.008259 0.35417 0.5702 0.73710
## ckjm_dit 0.13855 0.008259 1.000000 0.01973 -0.1148 -0.35746
## ckjm_ca 0.01328 0.354173 0.019730 1.00000 0.7506 0.44710
## ckjm_npm 0.16360 0.570169 -0.114814 0.75065 1.0000 0.62068
## ckjm_cbo -0.03983 0.737098 -0.357457 0.44710 0.6207 1.00000
## ckjm_noc 0.01127 0.163896 -0.177063 0.37648 0.3714 0.20376
## ckjm_rfc 0.22045 0.962672 -0.020208 0.44928 0.6698 0.77360
## ckjm_lcom 0.16181 0.738295 0.066292 0.54449 0.6375 0.56876
## ckjm_wmc 0.22742 0.850710 0.022250 0.66296 0.8238 0.67275
## ckjm_noc ckjm_rfc ckjm_lcom ckjm_wmc
## discussion 0.01127 0.22045 0.16181 0.22742
## raw_loc 0.16390 0.96267 0.73829 0.85071
## ckjm_dit -0.17706 -0.02021 0.06629 0.02225
## ckjm_ca 0.37648 0.44928 0.54449 0.66296
## ckjm_npm 0.37139 0.66981 0.63755 0.82385
## ckjm_cbo 0.20376 0.77360 0.56876 0.67275
## ckjm_noc 1.00000 0.20271 0.20823 0.30841
## ckjm_rfc 0.20271 1.00000 0.73811 0.89357
## ckjm_lcom 0.20823 0.73811 1.00000 0.82654
## ckjm_wmc 0.30841 0.89357 0.82654 1.00000
From this list, by filtering among all possible combinations of predictors, aside those who are correlated we are left which the following possible predictor combination:
Model 1: ['dit', 'ca', 'rawloc', 'noc'] Model 2: ['dit', 'npm', 'rawloc', 'noc'] Model 3: ['dit', 'rfc', 'ca', 'noc'] Model 4: ['dit', 'npm', 'rfc', 'noc'] Model 5: ['dit', 'ca', 'lcom', 'cbo', 'noc'] Model 6: ['dit', 'wmc', 'ca', 'cbo', 'noc'] Model 7: ['dit', 'npm', 'lcom', 'cbo', 'noc']
We now fit the poisson models.
model1 <- glm(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.train,
family = poisson)
model2 <- glm(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.train,
family = poisson)
model3 <- glm(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.train,
family = poisson)
model4 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, data = derby.discussion.train,
family = poisson)
model5 <- glm(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + ckjm_noc,
data = derby.discussion.train, family = poisson)
model6 <- glm(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + ckjm_noc,
data = derby.discussion.train, family = poisson)
model7 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo + ckjm_noc,
data = derby.discussion.train, family = poisson)
# model1 model2 model3 model4 model5 model6
Zero inflated models
library(pscl)
## Loading required package: MASS
## Loading required package: mvtnorm
## Loading required package: coda
## Loading required package: lattice
## Loading required package: gam
## Loaded gam 1.06.2
## Loading required package: vcd
## Loading required package: colorspace
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
znmodel1 <- glm(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.list[[4]],
family = poisson)
znmodel2 <- glm(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.list[[4]],
family = poisson)
znmodel3 <- glm(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.list[[4]],
family = poisson)
znmodel4 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc, data = derby.discussion.list[[4]],
family = poisson)
znmodel5 <- glm(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo + ckjm_noc,
data = derby.discussion.list[[4]], family = poisson)
znmodel6 <- glm(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo + ckjm_noc,
data = derby.discussion.list[[4]], family = poisson)
znmodel7 <- glm(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo + ckjm_noc,
data = derby.discussion.list[[4]], family = poisson)
summary(znmodel1)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc,
## family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.231 -4.420 0.577 1.645 5.548
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.34e+00 4.22e-02 55.34 < 2e-16 ***
## ckjm_dit 4.47e-02 3.23e-02 1.38 0.166
## ckjm_ca -4.43e-03 2.49e-03 -1.78 0.075 .
## raw_loc 1.59e-04 2.18e-05 7.30 2.9e-13 ***
## ckjm_noc -9.57e-03 1.49e-02 -0.64 0.520
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.1 on 105 degrees of freedom
## Residual deviance: 1017.7 on 101 degrees of freedom
## AIC: 1372
##
## Number of Fisher Scoring iterations: 6
summary(znmodel2)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc,
## family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.204 -4.478 0.499 1.636 5.610
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.30e+00 4.21e-02 54.55 < 2e-16 ***
## ckjm_dit 6.42e-02 3.91e-02 1.64 0.10
## ckjm_npm 4.12e-04 1.07e-03 0.39 0.70
## raw_loc 1.38e-04 3.26e-05 4.24 2.2e-05 ***
## ckjm_noc -1.37e-02 1.50e-02 -0.91 0.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.1 on 105 degrees of freedom
## Residual deviance: 1020.8 on 101 degrees of freedom
## AIC: 1376
##
## Number of Fisher Scoring iterations: 5
summary(znmodel3)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc,
## family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.552 -4.243 0.533 1.640 5.662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.305351 0.042940 53.69 < 2e-16 ***
## ckjm_dit 0.046529 0.032220 1.44 0.149
## ckjm_rfc 0.001467 0.000212 6.93 4.3e-12 ***
## ckjm_ca -0.006462 0.002612 -2.47 0.013 *
## ckjm_noc -0.014122 0.015265 -0.93 0.355
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.1 on 105 degrees of freedom
## Residual deviance: 1017.7 on 101 degrees of freedom
## AIC: 1372
##
## Number of Fisher Scoring iterations: 6
summary(znmodel4)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc,
## family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.575 -4.411 0.471 1.699 5.686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.259906 0.041988 53.82 < 2e-16 ***
## ckjm_dit 0.082820 0.037511 2.21 0.02725 *
## ckjm_npm 0.000736 0.001054 0.70 0.48527
## ckjm_rfc 0.001102 0.000287 3.84 0.00012 ***
## ckjm_noc -0.018940 0.015360 -1.23 0.21754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.1 on 105 degrees of freedom
## Residual deviance: 1023.7 on 101 degrees of freedom
## AIC: 1378
##
## Number of Fisher Scoring iterations: 5
summary(znmodel5)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo +
## ckjm_noc, family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.807 -4.202 0.681 1.400 5.854
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.49e+00 5.04e-02 49.33 < 2e-16 ***
## ckjm_dit 1.38e-01 3.00e-02 4.61 4.0e-06 ***
## ckjm_ca 2.23e-04 2.53e-03 0.09 0.93
## ckjm_lcom 4.39e-05 9.76e-06 4.50 6.7e-06 ***
## ckjm_cbo -7.78e-03 1.79e-03 -4.34 1.4e-05 ***
## ckjm_noc -6.69e-03 1.46e-02 -0.46 0.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.1 on 105 degrees of freedom
## Residual deviance: 1038.8 on 100 degrees of freedom
## AIC: 1396
##
## Number of Fisher Scoring iterations: 6
summary(znmodel6)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo +
## ckjm_noc, family = poisson, data = derby.discussion.list[[4]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.28 -3.89 0.61 1.60 5.23
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.465674 0.048245 51.11 < 2e-16 ***
## ckjm_dit 0.017124 0.031349 0.55 0.585
## ckjm_wmc 0.006643 0.000462 14.38 < 2e-16 ***
## ckjm_ca -0.006287 0.002793 -2.25 0.024 *
## ckjm_cbo -0.012407 0.001608 -7.72 1.2e-14 ***
## ckjm_noc -0.018571 0.016443 -1.13 0.259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1098.13 on 105 degrees of freedom
## Residual deviance: 901.86 on 100 degrees of freedom
## AIC: 1259
##
## Number of Fisher Scoring iterations: 6
zmodel1 <- zeroinfl(discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc, data = derby.discussion.list[[4]])
zmodel2 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc, data = derby.discussion.list[[4]])
zmodel3 <- zeroinfl(discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc, data = derby.discussion.list[[4]])
zmodel4 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc,
data = derby.discussion.list[[4]])
zmodel5 <- zeroinfl(discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo +
ckjm_noc, data = derby.discussion.list[[4]])
## Error: system is computationally singular: reciprocal condition number =
## 5.91079e-20
zmodel6 <- zeroinfl(discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo +
ckjm_noc, data = derby.discussion.list[[4]])
zmodel7 <- zeroinfl(discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom + ckjm_cbo +
ckjm_noc, data = derby.discussion.list[[4]])
## Error: system is computationally singular: reciprocal condition number =
## 6.44213e-20
summary(zmodel1)
## Warning: NaNs produced
##
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc,
## data = derby.discussion.list[[4]])
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.912 -1.040 0.221 0.641 3.507
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.68e+00 4.19e-02 63.93 <2e-16 ***
## ckjm_dit 1.25e-01 5.11e-03 24.44 <2e-16 ***
## ckjm_ca -3.48e-03 2.30e-03 -1.51 0.130
## raw_loc 3.67e-05 NA NA NA
## ckjm_noc 3.78e-02 1.87e-02 2.02 0.044 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.646447 0.383328 -1.69 0.09172 .
## ckjm_dit 0.450426 0.420617 1.07 0.28423
## ckjm_ca 0.014622 0.021734 0.67 0.50109
## raw_loc -0.001349 0.000389 -3.47 0.00052 ***
## ckjm_noc 0.086919 0.098703 0.88 0.37853
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -346 on 10 Df
summary(zmodel2)
## Warning: NaNs produced
##
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc +
## ckjm_noc, data = derby.discussion.list[[4]])
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.680 -1.029 0.202 0.712 4.073
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.63e+00 4.10e-02 64.11 < 2e-16 ***
## ckjm_dit 1.83e-01 2.30e-02 7.96 1.7e-15 ***
## ckjm_npm 2.11e-03 3.19e-04 6.62 3.7e-11 ***
## raw_loc -2.59e-05 NA NA NA
## ckjm_noc 2.69e-02 1.93e-02 1.40 0.16
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.621536 0.388106 -1.60 0.1093
## ckjm_dit 0.451425 0.421888 1.07 0.2846
## ckjm_npm 0.011387 0.012837 0.89 0.3751
## raw_loc -0.001577 0.000502 -3.14 0.0017 **
## ckjm_noc 0.084227 0.096575 0.87 0.3831
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 17
## Log-likelihood: -345 on 10 Df
summary(zmodel3)
##
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca +
## ckjm_noc, data = derby.discussion.list[[4]])
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.751 -1.127 0.239 0.791 3.314
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.67720 0.04319 61.99 < 2e-16 ***
## ckjm_dit 0.12855 0.03388 3.79 0.00015 ***
## ckjm_rfc 0.00030 0.00023 1.30 0.19201
## ckjm_ca -0.00382 0.00267 -1.43 0.15306
## ckjm_noc 0.03578 0.01942 1.84 0.06540 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67123 0.38221 -1.76 0.079 .
## ckjm_dit 0.29959 0.37253 0.80 0.421
## ckjm_rfc -0.00643 0.00269 -2.39 0.017 *
## ckjm_ca 0.01267 0.02133 0.59 0.553
## ckjm_noc 0.10202 0.10017 1.02 0.308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -348 on 10 Df
summary(zmodel4)
##
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc +
## ckjm_noc, data = derby.discussion.list[[4]])
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.291 -1.140 0.162 0.780 3.690
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.636200 0.042979 61.34 < 2e-16 ***
## ckjm_dit 0.188912 0.037783 5.00 5.7e-07 ***
## ckjm_npm 0.002331 0.001051 2.22 0.027 *
## ckjm_rfc -0.000309 0.000281 -1.10 0.272
## ckjm_noc 0.028987 0.019706 1.47 0.141
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.64110 0.38410 -1.67 0.095 .
## ckjm_dit 0.30869 0.38738 0.80 0.426
## ckjm_npm 0.00722 0.01262 0.57 0.567
## ckjm_rfc -0.00718 0.00353 -2.03 0.042 *
## ckjm_noc 0.10482 0.09859 1.06 0.288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -346 on 10 Df
summary(zmodel5)
## Error: object 'zmodel5' not found
summary(zmodel6)
##
## Call:
## zeroinfl(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca +
## ckjm_cbo + ckjm_noc, data = derby.discussion.list[[4]])
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -4.266 -0.983 0.169 0.771 3.117
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.791085 0.047186 59.15 < 2e-16 ***
## ckjm_dit 0.061084 0.031167 1.96 0.05 .
## ckjm_wmc 0.004278 0.000519 8.25 < 2e-16 ***
## ckjm_ca -0.003350 0.002862 -1.17 0.24
## ckjm_cbo -0.011555 0.001635 -7.07 1.6e-12 ***
## ckjm_noc 0.035429 0.020205 1.75 0.08 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0113 0.4247 -2.38 0.0173 *
## ckjm_dit 0.5876 0.3857 1.52 0.1277
## ckjm_wmc -0.0282 0.0106 -2.67 0.0077 **
## ckjm_ca 0.0232 0.0227 1.02 0.3076
## ckjm_cbo 0.0173 0.0146 1.18 0.2362
## ckjm_noc 0.1102 0.1046 1.05 0.2917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -309 on 12 Df
vuong(znmodel1, zmodel1)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.134
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.878e-13
vuong(znmodel2, zmodel2)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.14
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.681e-13
vuong(znmodel3, zmodel3)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.136
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 4.821e-13
vuong(znmodel4, zmodel4)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.168
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 3.81e-13
vuong(znmodel5, zmodel5)
## Error: object 'zmodel5' not found
vuong(znmodel6, zmodel6)
## Vuong Non-Nested Hypothesis Test-Statistic: -7.059
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## in this case:
## model2 > model1, with p-value 8.399e-13
We now analyze each of the models.
summary(model1)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + raw_loc + ckjm_noc,
## family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.216 -1.295 -0.906 1.060 4.872
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.77e+00 5.91e-02 29.92 < 2e-16 ***
## ckjm_dit 1.64e-02 4.36e-02 0.38 0.71
## ckjm_ca 3.60e-03 3.11e-03 1.16 0.25
## raw_loc 1.67e-04 3.61e-05 4.64 3.6e-06 ***
## ckjm_noc -3.26e-02 2.15e-02 -1.51 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 349.63 on 92 degrees of freedom
## AIC: 701.4
##
## Number of Fisher Scoring iterations: 5
We can see that in model 1 only the intercept and raw_loc are statistically significant correlated to discussion under p < 0.01%.
summary(model2)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + raw_loc + ckjm_noc,
## family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.914 -1.204 -0.894 0.458 4.814
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.75e+00 5.87e-02 29.82 < 2e-16 ***
## ckjm_dit 2.49e-02 4.44e-02 0.56 0.57432
## ckjm_npm 3.68e-03 1.09e-03 3.38 0.00073 ***
## raw_loc 9.76e-05 4.52e-05 2.16 0.03077 *
## ckjm_noc -3.21e-02 2.07e-02 -1.55 0.12056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 339.85 on 92 degrees of freedom
## AIC: 691.6
##
## Number of Fisher Scoring iterations: 5
Npm, raw_loc and intercept are statistically significant under p < 0.01%
summary(model3)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_rfc + ckjm_ca + ckjm_noc,
## family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.189 -1.253 -0.826 0.814 4.754
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.695639 0.062733 27.03 < 2e-16 ***
## ckjm_dit -0.001297 0.043907 -0.03 0.98
## ckjm_rfc 0.001906 0.000318 5.99 2.1e-09 ***
## ckjm_ca 0.000325 0.003306 0.10 0.92
## ckjm_noc -0.025712 0.022172 -1.16 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 335.77 on 92 degrees of freedom
## AIC: 687.5
##
## Number of Fisher Scoring iterations: 5
Rfc and intercept under p < 0.01%
summary(model4)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_rfc + ckjm_noc,
## family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.076 -1.312 -0.828 0.403 4.652
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.699445 0.062321 27.27 < 2e-16 ***
## ckjm_dit 0.007855 0.044602 0.18 0.86021
## ckjm_npm 0.001890 0.001285 1.47 0.14128
## ckjm_rfc 0.001471 0.000434 3.39 0.00069 ***
## ckjm_noc -0.029658 0.020574 -1.44 0.14942
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 333.61 on 92 degrees of freedom
## AIC: 685.4
##
## Number of Fisher Scoring iterations: 5
Rfc and intercept under p < 0.01%
summary(model5)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_ca + ckjm_lcom + ckjm_cbo +
## ckjm_noc, family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.29 -1.38 -1.02 1.14 5.03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.86e+00 7.34e-02 25.38 <2e-16 ***
## ckjm_dit 1.55e-02 4.63e-02 0.34 0.737
## ckjm_ca 4.37e-03 3.59e-03 1.22 0.224
## ckjm_lcom 5.37e-05 2.27e-05 2.36 0.018 *
## ckjm_cbo -2.87e-04 3.04e-03 -0.09 0.925
## ckjm_noc -4.18e-02 2.21e-02 -1.89 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 362.75 on 91 degrees of freedom
## AIC: 716.5
##
## Number of Fisher Scoring iterations: 5
Lcom and intercept under p < 0.01% and noc on 0.058
summary(model6)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_wmc + ckjm_ca + ckjm_cbo +
## ckjm_noc, family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.842 -1.310 -0.891 0.488 4.898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.796151 0.073749 24.35 < 2e-16 ***
## ckjm_dit -0.014735 0.047375 -0.31 0.756
## ckjm_wmc 0.003704 0.000612 6.05 1.4e-09 ***
## ckjm_ca 0.002774 0.003688 0.75 0.452
## ckjm_cbo -0.001730 0.002880 -0.60 0.548
## ckjm_noc -0.042953 0.022904 -1.88 0.061 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 336.88 on 91 degrees of freedom
## AIC: 690.6
##
## Number of Fisher Scoring iterations: 5
Wmc and intercept under p < 0.01%
summary(model7)
##
## Call:
## glm(formula = discussion ~ ckjm_dit + ckjm_npm + ckjm_lcom +
## ckjm_cbo + ckjm_noc, family = poisson, data = derby.discussion.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.988 -1.318 -0.908 0.933 4.654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.76e+00 7.82e-02 22.53 <2e-16 ***
## ckjm_dit 4.87e-02 4.81e-02 1.01 0.311
## ckjm_npm 5.28e-03 1.11e-03 4.76 2e-06 ***
## ckjm_lcom -1.09e-05 2.86e-05 -0.38 0.702
## ckjm_cbo 1.38e-03 2.76e-03 0.50 0.619
## ckjm_noc -3.65e-02 2.07e-02 -1.76 0.079 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 377.90 on 96 degrees of freedom
## Residual deviance: 343.88 on 91 degrees of freedom
## AIC: 697.6
##
## Number of Fisher Scoring iterations: 5
Npm and intercept under p < 0.01%
Concretly, the 7 models can be reduced to the following models:
Model 1: [rawloc] Model 2: [npm, rawloc] Model 3: [rfc] Model 4: [lcom] Model 5: [wmc] Model 6: [npm]
fmodel1 <- glm(discussion ~ raw_loc, data = derby.discussion.train, family = poisson)
fmodel2 <- glm(discussion ~ ckjm_npm + raw_loc, data = derby.discussion.train,
family = poisson)
fmodel3 <- glm(discussion ~ ckjm_rfc, data = derby.discussion.train, family = poisson)
fmodel4 <- glm(discussion ~ ckjm_lcom, data = derby.discussion.train, family = poisson)
fmodel5 <- glm(discussion ~ ckjm_wmc, data = derby.discussion.train, family = poisson)
fmodel6 <- glm(discussion ~ ckjm_npm, data = derby.discussion.train, family = poisson)
fmodel1
##
## Call: glm(formula = discussion ~ raw_loc, family = poisson, data = derby.discussion.train)
##
## Coefficients:
## (Intercept) raw_loc
## 1.769612 0.000182
##
## Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
## Null Deviance: 378
## Residual Deviance: 353 AIC: 698
fmodel2
##
## Call: glm(formula = discussion ~ ckjm_npm + raw_loc, family = poisson,
## data = derby.discussion.train)
##
## Coefficients:
## (Intercept) ckjm_npm raw_loc
## 1.732417 0.003338 0.000113
##
## Degrees of Freedom: 96 Total (i.e. Null); 94 Residual
## Null Deviance: 378
## Residual Deviance: 343 AIC: 691
fmodel3
##
## Call: glm(formula = discussion ~ ckjm_rfc, family = poisson, data = derby.discussion.train)
##
## Coefficients:
## (Intercept) ckjm_rfc
## 1.67257 0.00193
##
## Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
## Null Deviance: 378
## Residual Deviance: 338 AIC: 683
fmodel4
##
## Call: glm(formula = discussion ~ ckjm_lcom, family = poisson, data = derby.discussion.train)
##
## Coefficients:
## (Intercept) ckjm_lcom
## 1.86e+00 6.24e-05
##
## Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
## Null Deviance: 378
## Residual Deviance: 367 AIC: 713
fmodel5
##
## Call: glm(formula = discussion ~ ckjm_wmc, family = poisson, data = derby.discussion.train)
##
## Coefficients:
## (Intercept) ckjm_wmc
## 1.74622 0.00363
##
## Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
## Null Deviance: 378
## Residual Deviance: 341 AIC: 687
fmodel6
##
## Call: glm(formula = discussion ~ ckjm_npm, family = poisson, data = derby.discussion.train)
##
## Coefficients:
## (Intercept) ckjm_npm
## 1.781 0.005
##
## Degrees of Freedom: 96 Total (i.e. Null); 95 Residual
## Null Deviance: 378
## Residual Deviance: 349 AIC: 695
Furthermore, the AIC of all the 7 models are given as follows:
c(model1$aic, model2$aic, model3$aic, model4$aic, model5$aic, model6$aic, model7$aic)
## [1] 701.4 691.6 687.5 685.4 716.5 690.6 697.6
According to AIC the best model is 4, given by LCOM. We now test the models using a poisson model for each of the 7. We test their generalization using error functions from the Metrics Library on our test dataset in Derby which was randomly selected to be number fifth. See the documentation for details of the function implementation.
suppressWarnings(suppressMessages(library("Metrics")))
derby.discussion.test = derby.discussion.list[[5]]
fmodel1.e = rmse(predict(fmodel1, data.frame(raw_loc = derby.discussion.test$raw_loc)),
derby.discussion.test$discussion)
fmodel2.e = rmse(predict(fmodel2, data.frame(ckjm_npm = derby.discussion.test$ckjm_npm,
raw_loc = derby.discussion.test$raw_loc)), derby.discussion.test$discussion)
fmodel3.e = rmse(predict(fmodel3, data.frame(ckjm_rfc = derby.discussion.test$ckjm_rfc)),
derby.discussion.test$discussion)
fmodel4.e = rmse(predict(fmodel4, data.frame(ckjm_lcom = derby.discussion.test$ckjm_lcom)),
derby.discussion.test$discussion)
fmodel5.e = rmse(predict(fmodel5, data.frame(ckjm_wmc = derby.discussion.test$ckjm_wmc)),
derby.discussion.test$discussion)
fmodel6.e = rmse(predict(fmodel6, data.frame(ckjm_npm = derby.discussion.test$ckjm_npm)),
derby.discussion.test$discussion)
The RMSE of the 6 models is given as follows (see this for a reference of RMSE and other error measures):
c(fmodel1.e, fmodel2.e, fmodel3.e, fmodel4.e, fmodel5.e, fmodel6.e)
## [1] 7.199 7.203 7.186 7.172 7.185 7.210
Plot all scatterplots
plot(derby.discussion.list[[1]])
plot(derby.discussion.list[[2]])
plot(derby.discussion.list[[3]])
plot(derby.discussion.list[[4]])
plot(derby.discussion.list[[5]])
plot(derby.discussion.list[[6]])
plot(lucene.discussion.list[[1]])
plot(pdfbox.discussion.list[[1]])
plot(pdfbox.discussion.list[[2]])
plot(pdfbox.discussion.list[[3]])
plot(pdfbox.discussion.list[[4]])
plot(ivy.discussion.list[[1]])
plot(ivy.discussion.list[[2]])
Since most models have only a single variable, lets plot them:
plot(derby.discussion.list[[2]]$raw_loc, derby.discussion.list[[2]]$discussion,
pch = 19, col = "darkgrey", xlab = "Raw LOC", ylab = "Discussion")
lines(derby.discussion.list[[2]]$raw_loc, predict(fmodel1, data.frame(raw_loc = derby.discussion.list[[2]]$raw_loc),
type = "response"), col = "red", lwd = 3)
predict(fmodel1, data.frame(raw_loc = derby.discussion.list[[2]]$raw_loc), type = "response")
## 1 2 3 4 5 6 7 8 9 10
## 6.081 5.909 7.190 5.910 7.275 7.275 6.827 7.663 6.366 6.186
## 11 12 13 14 15 16 17 18 19 20
## 5.905 5.922 7.538 7.538 10.147 10.147 10.147 10.147 10.147 10.147
## 21 22 23 24 25 26 27 28 29 30
## 10.147 5.989 5.885 6.068 7.844 7.844 7.844 6.152 5.998 6.557
## 31 32 33 34 35 36 37 38 39 40
## 7.217 5.921 5.889 13.639 13.639 13.639 13.639 13.639 6.175 6.952
## 41 42 43 44 45 46 47 48 49 50
## 6.751 6.522 6.522 5.900 6.050 5.986 6.266 5.926 6.113 5.905
## 51 52 53 54 55 56 57 58 59 60
## 5.905 6.071 6.075 5.924 6.268 6.005 6.503 6.425 7.014 6.597
## 61 62 63 64 65 66 67 68 69 70
## 7.664 7.664 7.664 6.559 6.559 6.160 5.983 6.477 6.260 7.466
## 71 72 73 74 75 76 77 78 79 80
## 6.051 6.532 6.532 6.285 6.179 6.179 6.179 15.681 15.681 6.285
## 81 82 83 84 85 86 87 88 89 90
## 6.084 6.043 6.090 6.386 6.162 6.518 9.241 6.826 6.237 6.359
## 91 92 93 94 95 96 97 98 99 100
## 7.146 8.820 6.795 6.795 8.795 7.032 5.994 6.156 6.452 6.452
## 101 102 103 104 105 106 107 108 109 110
## 7.825 6.082 6.078 6.006 6.052 5.973 5.924 5.936 6.919 5.989
## 111 112 113 114 115 116 117 118 119 120
## 5.989 5.913 6.002 6.469 6.469 6.219 6.089 6.540 6.093 6.905
## 121 122 123 124 125 126 127 128 129 130
## 7.629 6.255 7.919 6.602 6.602 6.577 6.448 6.491 6.416 5.917
## 131 132 133 134 135 136 137 138 139 140
## 6.246 5.927 5.960 7.159 7.159 7.159 6.379 8.222 8.222 6.251
## 141 142 143 144 145 146 147 148 149 150
## 5.903 6.523 6.476 6.299 10.688 7.391 6.095 7.349 6.099 6.099
## 151 152 153 154 155 156 157 158 159 160
## 6.585 5.901 6.483 7.256 6.485 6.540 5.894 5.894 6.366 6.366
## 161 162
## 6.484 6.554
derby.discussion.list[[2]]$discussion
## [1] 4 20 11 20 4 15 2 4 4 4 4 15 6 15 14 9 5 6 3 15 4 4 4
## [24] 4 4 6 15 4 4 6 4 2 2 4 7 11 3 46 46 46 15 4 1 1 3 3
## [47] 2 13 13 6 5 21 21 1 4 1 8 1 4 3 14 6 5 6 5 3 3 3 4
## [70] 13 13 6 5 1 4 1 6 21 1 21 21 21 4 1 7 3 4 21 21 21 4 12
## [93] 4 12 21 12 21 12 21 3 21 21 21 21 21 21 21 21 21 6 5 21 21 4 3
## [116] 21 21 1 1 1 7 2 3 5 7 4 3 1 8 15 15 20 20 6 15 20 20 15
## [139] 20 15 15 5 46 12 12 8 12 15 2 4 2 1 46 4 4 4 3 3 15 3 15
## [162] 37
My conclusions are that we can't establish a relation between the structural complexity metrics and the effort estimator discussion. This follows mostly from the 6 scatterplots matrix in respect to the behavior of the discussion cost estimator against each of the file metrics. Furthermore, the statistical models do not suggest that any composition of the file metrics would help on establishing a relation between structural complexity and file metrics. Lastly, this might be due to the way we distribute issue discussion towards each file metric. Concretly, since we repeat the value for each file that was submitted in a patch, we will see many repeated values of discussion for the same file metric. Any previous relation between the amount of discussion and the associated file may thus be influentiating on the relation.
# Create a list of dataframes (tables) where each dataframe contains
# datapoints of a given release for each project.
derby.actions.list = split(derby.discussion, factor(derby$release))
lucene.actions.list = split(lucene.discussion, factor(lucene$release))
pdfbox.actions.list = split(pdfbox.discussion, factor(pdfbox$release))
ivy.actions.list = split(ivy.discussion, factor(ivy$release))
Lets see all plots for actions
plot(derby.actions.list[[1]])